Conversation
🦋 Changeset detectedLatest commit: 2a1b9a0 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Pull request overview
Adds a local “parser” CLI to crawl a site, extract page metadata, and persist results to disk with URL/filename sanitization.
Changes:
- Introduces a Puppeteer-based crawler/metadata extractor and local JSON output.
- Adds URL normalization + filesystem-safe filename sanitization utilities.
- Adds build/test tooling (TypeScript + Jest) and an error-handling integration test.
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| apps/parser/src/parser.ts | Adds the CLI entrypoint: reachability check, crawl orchestration, page parsing, persistence. |
| apps/parser/src/modules/crawler.ts | Implements recursive crawl + scope filtering + link discovery. |
| apps/parser/src/modules/domActions.ts | Expands interactive UI sections before scraping text/links. |
| apps/parser/src/modules/config.ts | Resolves env-based configuration and output directory derivation. |
| apps/parser/src/modules/output.ts | Creates output directory and writes JSON snapshots. |
| apps/parser/src/modules/errors.ts | Centralizes fatal error handling/exit code. |
| apps/parser/src/modules/types.ts | Adds typed metadata/node structures for crawl results. |
| apps/parser/src/utils/url.ts | Adds URL normalization helpers and remote URL detection. |
| apps/parser/src/utils/sanitizeFilename.ts | Adds filesystem-safe filename sanitization. |
| apps/parser/tests/parser.error-handling.test.ts | Adds integration test for unreachable/nonexistent URL behavior. |
| apps/parser/package.json | Adds build/parse/test scripts and required dependencies. |
| apps/parser/jest.config.ts | Configures Jest + ts-jest for the parser app. |
| apps/parser/tsconfig.json | Adds parser app TS config for dev/test typechecking. |
| apps/parser/tsconfig.build.json | Adds build TS config emitting to dist/. |
| apps/parser/README.md | Documents CLI usage, env vars, and tests. |
| .changeset/wide-hairs-fail.md | Changeset entry for the new parser feature. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ng of the same document
…a/developer-portal into CAI-749-parser-url-crawler
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…le navigation_timeout parameter. Remove unused isRemoteUrl
Co-authored-by: MarBert <41899883+MarBert@users.noreply.github.com>
… code and add new helper for dates
| if (!input) { | ||
| return DEFAULT_REPLACEMENT; |
There was a problem hiding this comment.
i suggest to add a warning log so that we can understand if something went wrong and the DEFAULT_REPLACEMENT is used
| return candidate; | ||
| } | ||
|
|
||
| export const UrlWithoutAnchors = (rawUrl: string): string => { |
There was a problem hiding this comment.
i suggest a name like RemoveAnchorsFromUrl, it's easier to understant what this does
apps/parser/src/modules/crawler.ts
Outdated
| try { | ||
| page = await browser.newPage(); | ||
| await page.goto(node.url, { | ||
| waitUntil: "networkidle2", |
There was a problem hiding this comment.
can we use a constant for this?
apps/parser/src/modules/crawler.ts
Outdated
| url.hash = ""; | ||
| return UrlWithoutAnchors(url.toString()); | ||
| } catch (_error) { | ||
| return rawUrl; |
There was a problem hiding this comment.
i suggest to add a warning here
apps/parser/src/parser.ts
Outdated
| } | ||
| } | ||
|
|
||
| async function persistSnapshot(snapshot: ParsedMetadata, FILENAME_LENGTH_THRESHOLD: number): Promise<void> { |
There was a problem hiding this comment.
isn't FILENAME_LENGTH_THRESHOLD a const reachable inside this function ? Why do you need to pass it as a parameter?
apps/parser/src/parser.ts
Outdated
| 0, | ||
| env.maxDepth, | ||
| parsedPages, | ||
| parsePageFn, |
There was a problem hiding this comment.
why do you have to pass the function here? Can't we export the function and call it inside parsePages?
MarBert
left a comment
There was a problem hiding this comment.
Aggiungere un log con testo "Completed parsing of page [...]. Found [...] links. [nr liks parsed/nr links to parse]" per monitorare l'avanzamento del programma;
Rimuovere il trattino finale nei nomi dei file .json;
Gestire il problema del dominio con uno split al 1° slash dopo https:// e verifica che sia uguale al baseUrl oppure a {variante_valida}.dominio - aggiungere un TODO per soluzione più generale;
Diminuire FILENAME_LENGTH_THRESHOLD a 250, in modo che ci sia spazio per l'estensione nel filename;
Fare in modo che il nome della cartella dove sono salvati i json sia del tipo www.uqido.com, senza http... iniziale;
Chiamare _.json il file relativo alla homepage - poi chiedere conferma a Ciri se servono nomi specifici;
Rinominare la funzione parsePages in exploreAndParsePages;
Rinominare il file parser.ts in main.ts e spostare tutte le funzioni, a eccezione del lancio della ricorsione, negli helpers;
Rinominare crawler.ts in parser.ts e spostarci la funzione parsePageFn, rinominata in generatePageParsedMetadata;
Verificare quali sono i parametri che non è necessario passare alla ricorsione ed esportarli come un nuovo oggetto recursionMetadata;
- Update sanitizeUrlAsFilename to support length threshold and hash suffix for long URLs. - Modify resolveEnv to parse validDomainVariants from environment variables. - Refactor parsePages to utilize validDomainVariants for scope checking. - Update EnvConfig type to include validDomainVariants. - Improve tests for sanitizeUrlAsFilename to cover new functionality.
…me function so that for root it returns the hostname
… UrlWithoutAnchors to RemoveAnchorsFromUrl
…,remove constants from recursion parameters and enhance helpers
Branch is not up to date with base branch@anemone008 it seems this Pull Request is not updated with base branch. |
…mited depth, implement base scope handling, and enhance URL sanitization functions for Directory names
Jira Pull Request LinkThis Pull Request refers to the following Jira issue CAI-749 |
|
This PR exceeds the recommended size of 800 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
List of Changes
Add script for parsing to parser app. Parsed content is saved locally. Urls are sanitized for filesystems and used as file names.
Motivation and Context
How Has This Been Tested?
Tested for errors associated to non-existent or unreachable urls. Reproducible through
npm testas described in the README.mdScreenshots (if appropriate):
Types of changes
Checklist: